OcrV1, Main, Exploration, bibRecord, 000A63

Using topic models for OCR correction

Identifieur interne : 000A63 ( Main/Exploration ); précédent : 000A62; suivant : 000A64

Using topic models for OCR correction

Auteurs : Faisal Farooq [États-Unis] ; Anurag Bhardwaj [États-Unis] ; Venugopal Govindaraju [États-Unis]

Source :

International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2009.

RBID : Pascal:10-0180626

Descripteurs français

Pascal (Inist)
- Reconnaissance caractère, Reconnaissance optique caractère, Analyse documentaire, Reconnaissance forme, Caractère manuscrit, Mot, Langage naturel, Base de données, Vocabulaire, Lexique, Gestion document, Catégorisation, Modélisation, Méthode entropie maximum.
Wicri :
- topic : Base de données.

English descriptors

KwdEn :
- Categorization, Character recognition, Database, Document analysis, Document management, Lexicon, Manuscript character, Method of maximum entropy, Modeling, Natural language, Optical character recognition, Pattern recognition, Vocabulary, Word.

Abstract

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000196
to stream PascalFrancis, to step Curation: 000581
to stream PascalFrancis, to step Checkpoint: 000178
to stream Main, to step Merge: 000A72
to stream Main, to step Curation: 000A63

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Using topic models for OCR correction</title>
<author><name sortKey="Farooq, Faisal" sort="Farooq, Faisal" uniqKey="Farooq F" first="Faisal" last="Farooq">Faisal Farooq</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Image and Knowledge Management, Siemens Medical Solutions</s1>
<s2>Malvern, PA</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Bhardwaj, Anurag" sort="Bhardwaj, Anurag" uniqKey="Bhardwaj A" first="Anurag" last="Bhardwaj">Anurag Bhardwaj</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Buffalo, NY</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
</affiliation>
</author>
<author><name sortKey="Govindaraju, Venu" sort="Govindaraju, Venu" uniqKey="Govindaraju V" first="Venu" last="Govindaraju">Venugopal Govindaraju</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Buffalo, NY</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
<placeName><settlement type="city">Buffalo (New York)</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="university" n="3">Université d'État de New York à Buffalo</orgName>
<orgName type="institution">Université d'État de New York</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">10-0180626</idno>
<date when="2009">2009</date>
<idno type="stanalyst">PASCAL 10-0180626 INIST</idno>
<idno type="RBID">Pascal:10-0180626</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000196</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000581</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000178</idno>
<idno type="wicri:doubleKey">1433-2833:2009:Farooq F:using:topic:models</idno>
<idno type="wicri:Area/Main/Merge">000A72</idno>
<idno type="wicri:Area/Main/Curation">000A63</idno>
<idno type="wicri:Area/Main/Exploration">000A63</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Using topic models for OCR correction</title>
<author><name sortKey="Farooq, Faisal" sort="Farooq, Faisal" uniqKey="Farooq F" first="Faisal" last="Farooq">Faisal Farooq</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Image and Knowledge Management, Siemens Medical Solutions</s1>
<s2>Malvern, PA</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Bhardwaj, Anurag" sort="Bhardwaj, Anurag" uniqKey="Bhardwaj A" first="Anurag" last="Bhardwaj">Anurag Bhardwaj</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Buffalo, NY</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
</affiliation>
</author>
<author><name sortKey="Govindaraju, Venu" sort="Govindaraju, Venu" uniqKey="Govindaraju V" first="Venu" last="Govindaraju">Venugopal Govindaraju</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Department of Computer Science and Engineering, University at Buffalo</s1>
<s2>Buffalo, NY</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">État de New York</region>
<settlement type="city">Buffalo (New York)</settlement>
</placeName>
<orgName type="university">Université d'État de New York à Buffalo</orgName>
<placeName><settlement type="city">Buffalo (New York)</settlement>
<region type="state">État de New York</region>
</placeName>
<orgName type="university" n="3">Université d'État de New York à Buffalo</orgName>
<orgName type="institution">Université d'État de New York</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2009">2009</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Categorization</term>
<term>Character recognition</term>
<term>Database</term>
<term>Document analysis</term>
<term>Document management</term>
<term>Lexicon</term>
<term>Manuscript character</term>
<term>Method of maximum entropy</term>
<term>Modeling</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Vocabulary</term>
<term>Word</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Analyse documentaire</term>
<term>Reconnaissance forme</term>
<term>Caractère manuscrit</term>
<term>Mot</term>
<term>Langage naturel</term>
<term>Base de données</term>
<term>Vocabulaire</term>
<term>Lexique</term>
<term>Gestion document</term>
<term>Catégorisation</term>
<term>Modélisation</term>
<term>Méthode entropie maximum</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Base de données</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Pennsylvanie</li>
<li>État de New York</li>
</region>
<settlement><li>Buffalo (New York)</li>
</settlement>
<orgName><li>Université d'État de New York</li>
<li>Université d'État de New York à Buffalo</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Pennsylvanie"><name sortKey="Farooq, Faisal" sort="Farooq, Faisal" uniqKey="Farooq F" first="Faisal" last="Farooq">Faisal Farooq</name>
</region>
<name sortKey="Bhardwaj, Anurag" sort="Bhardwaj, Anurag" uniqKey="Bhardwaj A" first="Anurag" last="Bhardwaj">Anurag Bhardwaj</name>
<name sortKey="Govindaraju, Venu" sort="Govindaraju, Venu" uniqKey="Govindaraju V" first="Venu" last="Govindaraju">Venugopal Govindaraju</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A63 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000A63 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:10-0180626
   |texte=   Using topic models for OCR correction
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Using topic models for OCR correction

Using topic models for OCR correction

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri